{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "ef331be9",
      "metadata": {
        "id": "ef331be9"
      },
      "source": [
        "![nebullvm nebuly AI accelerate inference optimize DeepLearning](https://user-images.githubusercontent.com/38586138/201391643-a80407e5-2c28-409c-90c9-327795cd27e8.png)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "f260653a",
      "metadata": {
        "id": "f260653a"
      },
      "source": [
        "# Accelerate Hugging Face PyTorch DistilBERT with Speedster\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8bdf3af5",
      "metadata": {
        "id": "8bdf3af5"
      },
      "source": [
        "Hi and welcome 👋\n",
        "\n",
        "In this notebook we will discover how in just a few steps you can speed up the response time of deep learning model inference using the Speedster app from the open-source library nebullvm.\n",
        "\n",
        "With Speedster's latest API, you can speed up models up to 10 times without any loss of accuracy (option A), or accelerate them up to 20-30 times by setting a self-defined amount of accuracy/precision that you are willing to trade off to get even lower response time (option B). To accelerate your model, Speedster takes advantage of various optimization techniques such as deep learning compilers (in both option A and option B), quantization, half accuracy, and so on (option B).\n",
        "\n",
        "Let's jump to the code."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d527d63b",
      "metadata": {
        "id": "d527d63b"
      },
      "outputs": [],
      "source": [
        "%env CUDA_VISIBLE_DEVICES=0"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cXXh1ifQ13mH",
      "metadata": {
        "id": "cXXh1ifQ13mH"
      },
      "source": [
        "# Installation"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "48aljCHu14-H",
      "metadata": {
        "id": "48aljCHu14-H"
      },
      "source": [
        "Install Speedster:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "QFQh3BVr1-GO",
      "metadata": {
        "id": "QFQh3BVr1-GO"
      },
      "outputs": [],
      "source": [
        "!pip install speedster"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8a7a86b3",
      "metadata": {
        "id": "8a7a86b3"
      },
      "source": [
        "Install deep learning compilers:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "cffbfa32",
      "metadata": {
        "id": "cffbfa32"
      },
      "outputs": [],
      "source": [
        "!python -m nebullvm.installers.auto_installer --frameworks huggingface --compilers all"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "73072506",
      "metadata": {
        "id": "73072506"
      },
      "source": [
        "## Model and Dataset setup"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "cf24c4c4",
      "metadata": {},
      "source": [
        "Add tensorrt installation path to the LD_LIBRARY_PATH env variable, in order to activate TensorrtExecutionProvider for ONNXRuntime"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1cf8ff74",
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "tensorrt_path = \"/usr/local/lib/python3.8/dist-packages/tensorrt\"  # Change this path according to your TensorRT location\n",
        "\n",
        "if os.path.exists(tensorrt_path):\n",
        "    os.environ['LD_LIBRARY_PATH'] += f\":{tensorrt_path}\"\n",
        "else:\n",
        "    print(\"Unable to find TensorRT path. ONNXRuntime won't use TensorrtExecutionProvider.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e4d55115",
      "metadata": {
        "id": "e4d55115"
      },
      "source": [
        "We chose DistilBERT as the pre-trained model that we want to optimize. Let's download both the pre-trained model and the tokenizer from the Hugging Face model hub."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d633cf21",
      "metadata": {
        "id": "d633cf21",
        "scrolled": true
      },
      "outputs": [],
      "source": [
        "import torch\n",
        "from transformers import DistilBertTokenizer, DistilBertModel\n",
        "\n",
        "tokenizer = DistilBertTokenizer.from_pretrained(\"distilbert-base-uncased\")\n",
        "model = DistilBertModel.from_pretrained(\"distilbert-base-uncased\", torchscript=True)\n",
        "\n",
        "# Move the model to gpu if available and set eval mode\n",
        "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
        "model.to(device).eval()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "11aa0739",
      "metadata": {
        "id": "11aa0739"
      },
      "source": [
        "Let's create an example dataset with some random sentences"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "cbbfeeb2",
      "metadata": {
        "id": "cbbfeeb2"
      },
      "outputs": [],
      "source": [
        "import random\n",
        "\n",
        "sentences = [\n",
        "    \"Mars is the fourth planet from the Sun.\",\n",
        "    \"has a crust primarily composed of elements\",\n",
        "    \"However, it is unknown\",\n",
        "    \"can be viewed from Earth\",\n",
        "    \"It was the Romans\",\n",
        "]\n",
        "\n",
        "len_dataset = 100\n",
        "\n",
        "texts = []\n",
        "for _ in range(len_dataset):\n",
        "    n_times = random.randint(1, 30)\n",
        "    texts.append(\" \".join(random.choice(sentences) for _ in range(n_times)))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a09f9424",
      "metadata": {
        "id": "a09f9424"
      },
      "outputs": [],
      "source": [
        "encoded_inputs = [tokenizer(text, return_tensors=\"pt\") for text in texts]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "17040431",
      "metadata": {
        "id": "17040431"
      },
      "source": [
        "## Speed up inference with Speedster: no metric drop"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "44ddc21d",
      "metadata": {
        "id": "44ddc21d"
      },
      "source": [
        "It's now time of improving a bit the performance in terms of speed. Let's use `Speedster`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f9d934f6",
      "metadata": {
        "id": "f9d934f6"
      },
      "outputs": [],
      "source": [
        "from speedster import optimize_model, save_model, load_model"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "76248033",
      "metadata": {
        "id": "76248033"
      },
      "source": [
        "Using Speedster is very simple and straightforward! Just use the `optimize_model` function and provide as input the model, some input data as example and the optimization time mode. Optionally a dynamic_info dictionary can be also provided, in order to support inputs with dynamic shape."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "zPC_EDwEJIM0",
      "metadata": {
        "id": "zPC_EDwEJIM0"
      },
      "outputs": [],
      "source": [
        "dynamic_info = {\n",
        "    \"inputs\": [\n",
        "        {0: 'batch', 1: 'num_tokens'},\n",
        "        {0: 'batch', 1: 'num_tokens'}\n",
        "    ],\n",
        "    \"outputs\": [\n",
        "        {0: 'batch', 1: 'num_tokens'}\n",
        "    ]\n",
        "}\n",
        "\n",
        "optimized_model = optimize_model(\n",
        "    model=model,\n",
        "    input_data=encoded_inputs,\n",
        "    optimization_time=\"constrained\",\n",
        "    ignore_compilers=[\"tensor_rt\", \"tvm\"],  # TensorRT does not work for this model\n",
        "    dynamic_info=dynamic_info,\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "98c6ab09",
      "metadata": {
        "id": "98c6ab09"
      },
      "outputs": [],
      "source": [
        "import time\n",
        "\n",
        "# Move inputs to gpu if available\n",
        "encoded_inputs = [tokenizer(text, return_tensors=\"pt\").to(device) for text in texts]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6e5b3b21",
      "metadata": {
        "id": "6e5b3b21"
      },
      "source": [
        "Let's run the prediction 100 times to calculate the average response time of the original model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d3bc5c98",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "d3bc5c98",
        "outputId": "e0596cf2-fa96-4c50-c012-f5cdab82e681"
      },
      "outputs": [],
      "source": [
        "times = []\n",
        "\n",
        "# Warmup for 30 iterations\n",
        "for encoded_input in encoded_inputs[:30]:\n",
        "    with torch.no_grad():\n",
        "        final_out = model(**encoded_input)\n",
        "\n",
        "# Benchmark\n",
        "for encoded_input in encoded_inputs:\n",
        "    st = time.time()\n",
        "    with torch.no_grad():\n",
        "        final_out = model(**encoded_input)\n",
        "    times.append(time.time()-st)\n",
        "original_model_time = sum(times)/len(times)*1000\n",
        "print(f\"Average response time for original DistilBERT: {original_model_time} ms\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "12c2df98",
      "metadata": {
        "id": "12c2df98"
      },
      "source": [
        "Let's see the output of the original model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4892a905",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4892a905",
        "outputId": "68d9b65f-e2cc-4998-8047-c9091f977698"
      },
      "outputs": [],
      "source": [
        "model(**encoded_input)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3db0a7a1",
      "metadata": {
        "id": "3db0a7a1"
      },
      "source": [
        "Let's run the prediction 100 times to calculate the average response time of the optimized model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a3e83997",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "a3e83997",
        "outputId": "7a416b14-f170-4df9-d416-026f06a7d980"
      },
      "outputs": [],
      "source": [
        "times = []\n",
        "\n",
        "# Warmup for 30 iterations\n",
        "for encoded_input in encoded_inputs[:30]:\n",
        "    with torch.no_grad():\n",
        "        final_out = optimized_model(**encoded_input)\n",
        "\n",
        "# Benchmark\n",
        "for encoded_input in encoded_inputs:\n",
        "    st = time.time()\n",
        "    with torch.no_grad():\n",
        "        final_out = optimized_model(**encoded_input)\n",
        "    times.append(time.time()-st)\n",
        "optimized_model_time = sum(times)/len(times)*1000\n",
        "print(f\"Average response time for optimized DistilBERT (no metric drop): {optimized_model_time} ms\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0d884d61",
      "metadata": {
        "id": "0d884d61"
      },
      "source": [
        "Let's see the output of the optimized_model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "75611b2e",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "75611b2e",
        "outputId": "035d5c6d-fd7a-4506-af09-befcf9dd3b2d"
      },
      "outputs": [],
      "source": [
        "optimized_model(**encoded_input)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ceb60d8c",
      "metadata": {
        "id": "ceb60d8c"
      },
      "source": [
        "## Speed up inference with Speedster: metric drop"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7b1950d5",
      "metadata": {
        "id": "7b1950d5"
      },
      "source": [
        "This time we will use the `metric_drop_ths` argument to accept a little drop in terms of precision, in order to enable quantization and obtain an higher speedup"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "de5721d8",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "de5721d8",
        "outputId": "c9efff21-f963-47ff-e83d-a44615f90a10"
      },
      "outputs": [],
      "source": [
        "optimized_model = optimize_model(\n",
        "    model=model,\n",
        "    input_data=encoded_inputs,\n",
        "    optimization_time=\"constrained\",\n",
        "    ignore_compilers=[\"tensor_rt\", \"tvm\"],  # TensorRT does not work for this model\n",
        "    dynamic_info=dynamic_info,\n",
        "    metric_drop_ths=0.1,\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0fbfe6fa",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "0fbfe6fa",
        "outputId": "ada293f5-9b54-4186-8e48-74b994d4b797"
      },
      "outputs": [],
      "source": [
        "times = []\n",
        "\n",
        "# Warmup for 30 iterations\n",
        "for encoded_input in encoded_inputs[:30]:\n",
        "    with torch.no_grad():\n",
        "        final_out = model(**encoded_input)\n",
        "\n",
        "# Benchmark\n",
        "for encoded_input in encoded_inputs:\n",
        "    st = time.time()\n",
        "    with torch.no_grad():\n",
        "        final_out = model(**encoded_input)\n",
        "    times.append(time.time()-st)\n",
        "original_model_time = sum(times)/len(times)*1000\n",
        "print(f\"Average response time for original DistilBERT: {original_model_time} ms\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f89b7e6d",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "f89b7e6d",
        "outputId": "51e497e1-a533-432d-d68e-b373f0ef69cb"
      },
      "outputs": [],
      "source": [
        "model(**encoded_input)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "10d17b5c",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "10d17b5c",
        "outputId": "d5dc0acd-77e7-4054-b455-19343ff37951"
      },
      "outputs": [],
      "source": [
        "times = []\n",
        "\n",
        "# Warmup for 30 iterations\n",
        "for encoded_input in encoded_inputs[:30]:\n",
        "    with torch.no_grad():\n",
        "        final_out = optimized_model(**encoded_input)\n",
        "\n",
        "# Benchmark\n",
        "for encoded_input in encoded_inputs:\n",
        "    st = time.time()\n",
        "    with torch.no_grad():\n",
        "        final_out = optimized_model(**encoded_input)\n",
        "    times.append(time.time()-st)\n",
        "optimized_model_time = sum(times)/len(times)*1000\n",
        "print(f\"Average response time for optimized DistilBERT (metric drop): {optimized_model_time} ms\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6bf3d1fb",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6bf3d1fb",
        "outputId": "6163d8ba-254f-47d2-a468-a921622a15ba"
      },
      "outputs": [],
      "source": [
        "optimized_model(**encoded_input)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "ceb60d8c",
      "metadata": {
        "id": "ceb60d8c"
      },
      "source": [
        "## Save and reload the optimized model"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "d9eda1a0",
      "metadata": {},
      "source": [
        "We can easily save to disk the optimized model with the following line:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "62b6fcbf",
      "metadata": {},
      "outputs": [],
      "source": [
        "save_model(optimized_model, \"model_save_path\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "3c968d51",
      "metadata": {},
      "source": [
        "We can then load again the model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c1340c49",
      "metadata": {},
      "outputs": [],
      "source": [
        "optimized_model = load_model(\"model_save_path\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cb234e5e",
      "metadata": {
        "id": "cb234e5e"
      },
      "source": [
        "Great! Was it easy? How are the results? Do you have any comments?\n",
        "Share your optimization results and thoughts with <a href=\"https://discord.gg/RbeQMu886J\" target=\"_blank\"> our community on Discord</a>, where we chat about Speedster and AI acceleration.\n",
        "\n",
        "Note that the acceleration of Speedster depends very much on the hardware configuration and your AI model. Given the same input model, Speedster can accelerate it by 10 times on some machines and perform poorly on others.\n",
        "\n",
        "If you want to learn more about how Speedster works, look at other tutorials and performance benchmarks, check out the links below or write to us on Discord."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "b77ff2ac",
      "metadata": {
        "id": "b77ff2ac"
      },
      "source": [
        "<center> \n",
        "    <a href=\"https://discord.com/invite/RbeQMu886J\" target=\"_blank\" style=\"text-decoration: none;\"> Join the community </a> |\n",
        "    <a href=\"https://nebuly.gitbook.io/nebuly/welcome/questions-and-contributions\" target=\"_blank\" style=\"text-decoration: none;\"> Contribute to the library </a>\n",
        "</center>\n",
        "\n",
        "<center> \n",
        "    <a href=\"https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#key-concepts\" target=\"_blank\" style=\"text-decoration: none;\"> How speedster works </a> •\n",
        "    <a href=\"https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#documentation\" target=\"_blank\" style=\"text-decoration: none;\"> Documentation </a> •\n",
        "    <a href=\"https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#quick-start\" target=\"_blank\" style=\"text-decoration: none;\"> Quick start </a> \n",
        "</center>"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "provenance": []
    },
    "gpuClass": "premium",
    "kernelspec": {
      "display_name": "Python 3.8.10 64-bit",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.9 (default, Apr 13 2022, 08:48:06) \n[Clang 13.1.6 (clang-1316.0.21.2.5)]"
    },
    "vscode": {
      "interpreter": {
        "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}
